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Abstract 

Background: Phylogenetic tree comparison metrics are an important tool in the study of evolution, and hence 
the definition of such metrics is an interesting problem in phylogenetics. In a paper in Taxon fifty years ago, 
Sokal and Rohlf proposed to measure quantitatively the difference between a pair of phylogenetic trees by first 
encoding them by means of their half-matrices of cophenetic values, and then comparing these matrices. This 
idea has been used several times since then to define dissimilarity measures between phylogenetic trees but, to 
our knowledge, no proper metric on weighted phylogenetic trees with nested taxa based on this idea has been 
formally defined and studied yet. Actually, the cophenetic values of pairs of different taxa alone are not enough 
to single out phylogenetic trees with weighted arcs or nested taxa. 

Results: For every (rooted) phylogenetic tree T, let its cophenetic vector ip{T) consist of all pairs of cophenetic 
values between pairs of taxa in T and all depths of taxa in T. It turns out that these cophenetic vectors single out 

weighted phylogenetic trees with nested taxa. We then define a family of cophenetic metrics d^,p by comparing 
these cophenetic vectors by means of L'p norms, and we study, either analytically or numerically, some of their 
basic properties: neighbors, diameter, distribution, and their rank correlation with each other and with other 
metrics. 

Conclusions: The cophenetic metrics can be safely used on weighted phylogenetic trees with nested taxa and no 
restriction on degrees, and they can be computed in O(n^) time, where n stands for the number of taxa. The 
metrics d^^i and d^^2 have positive skewed distributions, and they show a low rank correlation with the Robinson- 
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Foulds metric and the nodal metrics, and a very high correlation with each other and with the splitted nodal 
metrics. The diameter of d^^j,, for p ^ 1, is in 0(n(P+^)/P), and thus for low p they are more discriminative, 
having a wider range of values. 

Background 

Many phylogenetic trees published in the Hterature or included in phylogenetic databases are actually al- 
ternative phylogenies for the same sets of organisms, obtained from different datasets or using different 



evolutionary models or different phylogenetic reconstruction algorithms 17 . This variety of phylogenetic 



trees makes it necessary to develop methods for measuring their differences 11 Chapter 30]. The comparison 



of phylogenetic trees is also used to compare phylogenetic trees obtained through numerical algorithms with 



other types of hierarchical classifications 27 32 , to assess the stability of reconstruction methods |37 , and 



in the comparative analysis of dendrograms and other hierarchical cluster structures 15 24 . Hence, and 



since the safest way to quantify the differences between a pair of trees is through a metric, "tree comparison 



metrics are an important tool in the study of evolution" 34 . 

Many metrics for the comparison of phylogenetic trees have been proposed so far |11, Chapter 30]. Some 
of these metrics are edit distances that count how many operations of a given type are necessary to transform 
one tree into the other. These metrics include the nearest-neighbor interchange metric [35] and the subtree 
prune-and-regrafting distance |2 . Other metrics compare a pair of phylogenetic trees through some consensus 
subtree. This is the case for instance of the MAST distances defined in [l2|[l3|[39] . Finally, many metrics for 
phylogenetic trees are based on the comparison of encodings of the phylogenetic trees, like for instance the 
Robinson-Foulds metric [25{[26] (which can also be understood as an edit distance), the triples metric 
the classical nodal metrics for binary phylogenetic trees (§l[9j[23][34j[37] , and the splitted nodal metrics for 
arbitrary phylogenetic trees [sj. The advantage of this last kind of metrics is that, unlike the edit and the 
consensus distances, they are usually computed in low polynomial time. 



In an already fifty years old paper 32 , Sokal and Rohlf proposed a technique to compare dendrograms 
(which, in their paper, were equivalent to weighted phylogenetic trees without nested taxa) on the same set 
of taxa, by encoding them by means of their half-matrices of cophenetic values, and then comparing these 
structures. Their method runs as follows. To begin with, they divide the range of depths of internal nodes 
in the tree into a suitable number of equal intervals and number increasingly these intervals. Then, for each 
pair of taxa i,j in the tree, they compute their cophenetic value as the class mark of the interval where 
the depth of their lowest common ancestor lies. Then, to compare two phylogenetic trees, they compare 
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their corresponding half-matrices of cophenetic values. In that paper, they do it specifically by calculating a 
correlation coefficient between their entries. Sokal and Rohlf 's paper |32 is quite cited (612 cites according to 
Google Scholar on July 1, 2012) and their method has been often used to compare hierarchical classifications 
(see, for instance, p|[6)[l9]). 

Since Sokal and Rohlf's paper, other papers have compared the half-matrices of cophenetic values to 



define dissimilarity measures between phylogenetic trees (see, for instance, 16 271), and such half-matrices 



have also been used in the so-called "comparative method", the statistical methods used to make inferences on 



the evolution of a trait among species from the distribution of other traits: see 14 22 and 11 Chapter 25]. 
But, to our knowledge, no proper metric for phylogenetic trees based on cophenetic values has been formally 
defined and studied in the literature. In this paper we define a new family of metrics for weighted phylogenetic 
trees with nested taxa based on Sokal and Rohlf's idea and we study some of their basic properties: neighbors, 
diameter, distribution, and their rank correlation with each other and with other metrics. 

Our approach differs in some minors points with Sokal and Rohlf's. For instance, we use as the cophenetic 
value Lp{i,j) of a pair of taxa i, j the actual depth of the lowest common ancestor of i and j, instead of class 
marks, which was done by Sokal and Rohlf because of practical limitations. Moreover, instead of using 
a correlation coefficient, we define metrics by using norms. Finally, we do not restrict ourselves to 
dendrograms, without internal labeled nodes, but we also allow nested taxa. 

There is, however, a main difference between our approach and Sokal and Rohlf's. We do not only 
consider the cophenetic values of pairs of taxa, but also the depths of the taxa. We must do so because 
we want to define a metric, where zero distance means isomorphism, and the cophenetic values of pairs of 
different taxa alone do not single out even the dendrograms considered by Sokal and Rohlf. That is, two 
non isomorphic weighted phylogenetic trees without nested taxa on the same set of taxa can have the same 
vectors of cophenetic values; see Fig.[2j 

It turns out that the cophenetic vector consisting of all cophenetic values of pairs of taxa and the depths 
of all taxa characterizes a weighted phylogenetic tree with nested taxa. This fact comes from the well known 
relationship between cophenetic values and patristic distances. If we denote by 5{i) the depth of a taxon i, 
by ip{i,j) the cophenetic value of a pair of taxa i,j and by d{i,j) the distance between i and j, then [ 10| 

So, if the depths of the taxa are known, the knowledge of the cophenetic values of pairs of taxa is equivalent 
to the knowledge of the additive distance defined by the tree. On their turn, the depths and the additive 
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distance single out the unrooted semi-labelled weighted tree associated to the phylogenetic tree with the 
former root labeled with a specific label "root", and hence the phylogenetic tree itself: cf. Theorem [T| 

The fact that cophenetic vectors single out weighted phylogenetic trees with nested taxa can also be 
deduced from their relationship with splitted path lengths |5 . Recall that the splitted path length £{i,j) is 
the distance from the lowest common ancestor of i and j to i. It is known |5, Thm. 10] that the matrix 
characterizes a weighted phylogenetic tree with nested taxa. Since, obviously, 

the cophenetic vector uniquely determines the matrix of splitted path lengths, and hence the tree|^ 



The vector of cophenetic values of pairs of different taxa is also related to the notion of ultrametric |18[31 
Indeed, notice that —ip satisfies the three-point condition of ultrametrics: for every taxa fc, 

i^mm{-(p{i,k),~ip{j,k)}. 

But —(p is not an ultrametric, as ^p{i,i) — S{i) ^ 0. Actually, ip can only be used to define an ultrametric 
precisely on ultrametric trees, where the depths of all leaves are the same, say A. In this case, A — </? is the 
ultrametric defined by the tree. In particular, ultrametric trees can be compared by comparing their vectors 



of cophenetic values of pairs of different taxa. A similar idea is used in 38 to induce an average genetic 
distance between populations from the average coancestry coefficient. 

We would like to dedicate this paper to the memory of Robert R. Sokal, father of the field of numerical 
taxonomy and who passed away last April. His ideas permeate biostatistics and computational phylogenetics. 

Notations 

A rooted tree is a directed finite graph that contains a distinguished node, called the root, from which every 
node can be reached through exactly one path. A weighted rooted tree is a pair (T, lu) consisting of a rooted 
tree T = {V, E) and a weight function lj : E ^ M>o that associates to every arc e e i? a non-negative 
real number uj[e) > 0. We identify every unweighted (that is, where no weight function has been explicitly 
defined) rooted tree T with the weighted rooted tree (T, lo) with cj the weight 1 constant function. 

Let T — {V, E) be a rooted tree. Whenever (u, v) g E, we say that w is a child of u and that u is the 
parent of v. Two nodes with the same parent are siblings. The nodes without children are the leaves of 



^There are some details to be filled here, because for technical reasons we shall allow the root of our phylogenetic trees to 
have out-degree 1 without being labeled, and this case is not covered by J5j Thm. 10], but it is not difficult to modify the 
argument given above to cover also this case. 
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the tree, and the other nodes (including the root) are called internal. A pendant arc is an arc ending in a 
leaf. The nodes with exactly one child are called elementary. A tree is binary, or fully resolved, when every 
internal node has exactly two children. 

Whenever there exists a path from a node u to a node v, we shall say that u is a descendant of u and also 
that u is an ancestor of v, and we shall denote it by w ^ u; if, moreover, u ^ w, we shall write v ^ u. The 
lowest common ancestor (LCA) of a pair of nodes u, w of a rooted tree T, in symbols is the unique 

common ancestor of them that is a descendant of every other common ancestor of them. Given a node v 
of a rooted tree T, the subtree of T rooted at v is the subgraph of T induced on the set of descendants of v 
(including v itself). A rooted subtree is a cherry when it has 2 leaves, a triplet, when it has 3 leaves, and a 
quartet, when it has 4 leaves. 

The distance from a node u to a descendant v of it in a weighted rooted tree T is the sum of the weights 
of the arcs in the unique path from u to v. In an unweighted rooted tree, this distance is simply the number 
of arcs in this path. The depth of a node v, in symbols 5t{v), is the distance from the root to v. 

Let 5 be a non-empty finite set of labels, or taxa. A [weighted) phylogenetic tree on 5 is a (weighted) 
rooted tree with some of its nodes bijectively labeled in the set S, including all its leaves and all its elementary 
nodes except possibly the root (which can be elementary but unlabeled). The reasons why we allow unlabeled 
elementary roots are that our results are still valid for phylogenetic trees containing them, and that even if 
we forbid them, we would need in some proofs to use that Theorem [T] below is true for phylogenetic trees 
containing them. Moreover, it is not uncommon to add an unlabeled elementary root to a phylogenetic tree 
in some contexts: see, for instance, the phylogenetic trees depicted in Wikipedia's entry "Phylogenetic tree"j^ 

In a phylogenetic tree, we shall always identify a labeled node with its taxon. The internal labeled nodes 
of a phylogenetic tree are called nested taxa. Notice in particular that a phylogenetic tree without nested 
taxa cannot have elementary nodes other than the root. Although in practice S may be any set of taxa, to 
fix ideas we shall usually take 5 = {1, . . . , n}, with n the number of labeled nodes of the tree, and we shall 
use the term phylogenetic tree with n taxa to refer to a phylogenetic tree on this set. In general, we shall 
denote by L{T) the set of taxa of a phylogenetic tree T. 

Given a set S of taxa, we shall consider the following spaces of phylogenetic trees: 

• 'WT{S) , of all weighted phylogenetic trees on S 

• UT{S) , of all unweighted phylogenetic trees on S 

: / / en.wikipedia.org/wiki /Phylogenetic_tree 
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• T{S), of all unweighted phylogenetic trees on S without nested taxa 

• BT{S), of all binary unweighted phylogenetic trees on S without nested taxa 

When S — {1, . . . we shall simply write WTn, l^Tm Tn, and BTn, respectively. 

Two phylogenetic trees T and T' on the same set S of taxa are isomorphic when they are isomorphic as 
directed graphs and the isomorphism sends each labeled node of T to the labeled node with the same label 
in T'. An isomorphism of weighted phylogenetic trees is also required to preserve arc weights. We shall make 
the abuse of notation of saying that two isomorphic trees are actually the same, and hence of denoting that 
two trees T, T' are isomorphic by simply writing T — T' . 

Methods 
Cophenetic vectors 

Let S be henceforth a non-empty set of taxa with \S\ — n, which without any loss of generality we identify 
with {1, . . . ,n}. Let T e WTn be a weighted phylogenetic tree on S. For every pair of different taxa i,j in 
T. their cophenetic value is the depth of their LCA: 

friij) = STiiiJW)- 

To simplify the notations, we shall often write ipT(i,i) to denote the depth (5t(*) of a taxon i. 
The cophenetic vector of T is 

with its elements lexicographically ordered in 

Example 1. If T is the unweighted phylogenetic tree in Fig. [7| then (p{T) is the vector obtained by alpha- 
betically ordering in {i,j) the elements of Tablein 




Figure 1: An unweighted phylogenetic tree on 7 taxa. 
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Table 1: Cophenetic values of the pairs of taxa in the phylogenetic tree T in Fig. [T] 

The cophenetic vectors single out weighted phylogenetic trees with nested taxa. 
Theorem 1. For every T,T' £ WT{S), if (p{T) = (p{T'), then T = T'. 

Proof. Let r be a symbol not belonging to S and let X = S* U {r}. Recall that a weighted X-tree is an 
undirected weighted tree T with set of nodes V endowed with a (non necessarily injective) node-labeling 
mapping f : X ^ V such that f{X) contains all the leaves and all the degree-2 nodes in T [29] . 

For every T G WTiS), let T* be the weighted X-tree obtained by considering T as undirected and 
adding to its former root the label r. Then, the distance dx' on T* between pairs of labels in X is uniquely 
determined by <p(r) in the following way: 

dT* {i, r) = Srii) for every i d S 

dT-{i,j) = Srii) + StU) ~ 2(pT{i,j) for every iJeS 
Now, T* is singled out by dx* (29l Thm. 7.1.8]. Since T is uniquely determined from T* and the knowledge 
of the root (that is the node labeled with r), we deduce that (p{T) singles out T. □ 

This result implies that the vectors of cophenetic values of pairs of different taxa single out unweighted 
phylogenetic trees without nested taxa. 

Corollary 1. For every T G Tn, let (p{T) = ('y'T(*: j)) G jj"("-i)/2^ jfg elements lexicographi- 

cally ordered in Then, for every T,T' G Tn, if (p{T) — ip{T'), then T = T' . 

Proof. If T is unweighted and without nested taxa, then, for every taxon i, 

Srii) = 1 + ina.x{(pT{i,j) | 1 ^ j n, j ^ i} 

and therefore, in this case, ip{T) is uniquely determined by lp(T). □ 

But in order to single out phylogenetic trees with non constant weights in the arcs or with nested taxa, 
it is necessary to take into account also the depths of the leaves. Actually, for example, there is no way to 
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reconstruct from (p{T) the weights of the pendant arcs: the depths of the leaves are needed. Or, without 
being able to compare depths with cophenetic values, there is no way to say whether a taxon is nested or 
not. More specifically, for instance, the three trees in Fig. [2] have the same value of ip{l, 2), and hence the 
same vector ip{T), but they are not isomorphic as weighted phylogenetic trees. 

Figure 2: Three non-isomorphic trees with the same vector (p{T). 

The cophenetic vector (p{T) of a weighted phylogenetic tree T G WTn can be computed in optimal 0{n'^) 
time (assuming a constant cost for the addition of real numbers) by computing for each internal node w, 
its depth St{v) through a preorder traversal of T, and the pairs of taxa of which v is the LCA through a 
postorder traversal of the tree. Both preorder and postorder traversals are performed in linear time on the 
usual tree data structures. 

Cophenetic metrics 

As we have seen in Theorem [T] the mapping 

: WTn — ^ M"("+i)/2 

that sends each T €E WT„ to its cophenetic vector f{T), is injective up to isomorphism. As it is well known, 
this allows to induce metrics on WTn from metrics defined on powers of M. In particular, every norm 
II • lip on M"(«+i)/2^ p ^ 1, induces a cophenetic metric d^^p on WTn by means of 

d^,p{Ti,T2) - MTi) - ^{T2)\\p, Ti,T2 e WTn- 

Recall that 

\\{xi,...,Xm)\\p = ^|xi|P + --- + |a;„,|P, 

and so, for instance, 

d^,l{Ti,T2)= \fTAhj)-'PT2{hj)\ 
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are the cophenetic metrics on WTn induced by the Manhattan and the eucHdean norms. One can 
also use Donoho's L'^ "norm" (which, actually, is not a proper norm) 



to induce a metric dy,o(21, T2) on WTn, which turns out to be simply the Hamming distance between ^{Ti) 
and ip{T2). 

As we have seen in the previous subsection, the cophenetic vector of a phylogenetic tree in yVT„ can be 
computed in O(n^) time. For every Ti,T2 G WTn, and assuming a constant cost for the addition and product 
of real numbers, the cost of computing d^fl(Ti, T2) (as the number of non-zero entries of tpiTi) — ip(T2)) is 
O(n^), and the cost of computing d,^,p{Ti,T2Y , for p ^ 1 (as the sum of the p-th powers of the entries of 
the difference ip{Ti) — f{T2)) is 0{n^ + log2(p)n^), which is again 0{n^) if we understand log(p) as part 
of the constant factor. Finally, the cost of computing (i^.p(ri, T2), p ^ 1, as the p-th root of d^^p{Ti,T2Y 
will depend on p and on the accuracy with which this root is computed. Assuming a constant cost for the 
computation of p-th roots with a given accuracy (notice that, in practice, for low p and accuracy, this step 
will be dominated by the computation of dip,p{Ti,T2)P), the total cost of computing d^^p{Ti,T2) is 0{n?). 

Next examples show some features of these cophenetic metrics. 

Example 2. Let T G hlTn, let iu,v) be an arc of T with u or v unlabeled, and let T' be the phylogenetic 
tree in lATn obtained by contracting {u^v): that is, by removing the node v and the arc {u,v), labeling u 
with the label of v if it was labeled, and replacing every arc {v,x) in T by an arc {u,x). Notice that, in the 
passage from T to T' , for every i,j G S: 

• // both i,j are descendants of v in T, then LpT'{i,j) — (pTihj) — 1- 

• In any other case, ipT'{i,j) = (fTihj)- 
As a consequence. 



So the contraction of an arc in an tree T (which is Robinson-Foulds ' a-operation \2(^ ) yields a new tree T' 
at a cophenetic distance from T that depends increasingly on the number of descendant taxa of the head of 




3;m)|lo — number of entries Xi that are 7^ 




and therefore, if n„ is the number of descendant taxa of v. 




the contracted arc. 



□ 
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Example 3. Let Tq, Tq e WTm, for some m < n, letT G WTn be such that its subtree rooted at some node 
z is To, and let T' e WTn be the tree obtained by replacing in T this subtree Tq by Tq. 

Notice that, for every i,j n}, (frii, j) = St{z) + fTo if i,j < m, and frii, j) = ^t{z, j) if 

i ^ m and j > m, and the same holds in T' , replacing T and Tq by T' and Tq, respectively. Since, moreover, 
StIz) = 6t' {z), (pT{z,j) = ifiT'iz, j) for every j > m, and ^T{i,j) = Vt' ihj) for every i,j > m, we conclude 
that 



So, the cophenetic metrics are local, as other popular metrics like the Robinson Foulds or the triples metrics. 



Results 

Minimum and maximum values for cophenetic metrics 

Our first goal is to find the smallest non-negative value of on several spaces of phylogenetic trees, and the 
pairs of trees at which it is reached. These pairs of trees at niininiuni distance can be understood as 'adjacent' 
in the corresponding metric space, and their characterization yields a first step towards understanding how 
cophenetic metrics measure the difference between two trees. 

Notice that this problem makes no sense for weighted phylogenetic trees. For instance, if we add or 
subtract an £ > to the weight of a pendant arc in a tree T, without changing its topology, the distance 
between T and the resulting tree will be e, which can be as small as desired. So, we only consider this 
problem on UTn, Tn, and BTn- 

In order to simplify the statements, set 



The following easy result, which is a direct consequence of the fact that Dp(Ti, T2) ^ Dq(Ti, T2) for every 
p ^ 1 and Ti, T2 S UTni will be used in the proof of the next propositions. 

Lemma 1. Assume that, for every pair of different trees Ti, T2 in UTn, Tn or BTn such that Z)o(Ti, T2) is 
minimum on this space, we have that Dp{Ti,T2) = Do{Ti,T2). Then, the minimum non-zero value of Dp 
on this space of trees is equal to the minimum non-zero value of Dq on it, and it is reached at exactly the 
same pairs of trees. □ 



and hence 



dtp,p{T,T') = d,p^p{To,To). 



but unlike other popular metrics, like for instance the nodal metrics. 



□ 
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The least non-negative values of Dp, for p e {0} U [l,oo[, on UTn, %i, and BTn, together with an 
explicit description of the pairs of trees where these minimum values are reached, are given by the next three 
propositions. We give their proofs in the Appendix. 

Proposition 1. The minimum non-negative value of Dp onlATn, for p Cz {0}U[l,oo[ andn^ 2, is 1. And 
for every T,T' G hiTn, Dp{T,T') = 1 if and only if one of them is obtained from the other by contracting 
a pendant arc. □ 

So, not every tree in UTn has neighbors at cophenetic distance 1: only those trees with some leaf whose 
parent is unlabeled. Now, it is not difficult to check that a tree T e UTn such that all its leaves have labeled 
parents has some tree T' such that Dp(T,T') = 2, which is the minimum value of Dp on UTn greater than 
1. One such T' is obtained by choosing a pendant arc in T and interchanging the labels of its source and its 
taxget nodes. 

Proposition 2. The minimum non-negative value of Dp on Tn, for p £ {0} U [l,oo[ and n ^ 3, is 3. And 
for every T,T' e Tn, Dp{T,T') — 3 if and only if one of them is obtained from the other by means of one 
of the following two operations: 

(a) Contracting an arc ending in the parent of a cherry (see Fig. 

(b) Pruning and regrafting a leaf that is a sibling of the root of a cherry, to make it a sibling of the leaves 
in the cherry (see Fig. m]) □ 



So, every tree T G T has neighbors T' such that Dp{T,T') = 3. Indeed, take an internal node v in T 
of largest depth, so that all its children are leaves. If v has exactly two children, one such neighbor of T 
is obtained by contracting the arc ending in v. If v has more than two children, one such neighbor of T is 
obtained by replacing any two children of w by a cherry (that is, taking two children i,j of v, removing the 
arcs {v,i) and (v,j), and then adding a new node Vq and arcs {v,Vq), {vQ,i), and (wq, j))- 




Figure 3: Contraction of an arc ending in the parent of a cherry. 
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Figure 4: Pruning and regrafting an uncle of a cherry to make it a sibling of them. 



Proposition 3. The minimum non-negative value of Dp on BTn, for p G {0} U [1, oo[ and n ^ 3, is 4. And 
for every T,T' G BTn, Dp{T,T') ~ 4 if and only if one of them is obtained from the other by means of 
one of the following operations: 



(a) Reorganizing a triplet (see Fig. 




Figure 6: Reorganizing a completely branched quartet. 

So again, every tree T G BTn has neighbors T' such that Dp{T,T') = 4. Indeed, take an internal node 
V in T of largest depth, so that its two children are leaves. Let w be the parent of v. Then, either the other 
child of w is a leaf, in which case w is the root of a triple and reorganizing its taxa we obtain a neighbor of 
T, or the other child of w is the parent of a cherry (it will have the same, maximum, depth as v), in which 
case w is the root of a completely branched quartet and reorganizing its taxa we obtain a neighbor of T. 
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We focus now on the diameter, that is, the largest value of ^ on the spaces of unweighted phylogenetic 
trees (as in the case of the minimum non-zero value, and for the same reasons, the problem of finding the 
diameter makes no sense for weighted trees). Unfortunately, we have not been able to find exact formulas 
for it, but we have obtained its order, which we give in the next proposition. We also give its proof in the 
Appendix. 

Proposition 4. The diameter of d^p^p on UTn, Tn, and BTn is in Qin^) if p = and in 0(n'^''+^)/P) if 

In particular, the diameter of d^^i on these spaces is in 0(n^), and the diameter of dtp. 2 is in 
Numerical experiments 

We have performed several numerical experiments concerning the distributions of d^p,i and dip,2, and the cor- 
relation of these metrics with other phylogenetic tree comparison metrics. The results of all these experiments 
can be found in the Supplementary Material web page |http: / / bioinfo.uib.es /"recerca/phylotrees / cophidist / 
In this section we report only on some significant results obtained through these experiments. 

As a first experiment, we have generated all trees in BTn and 7^, for n — 3,4,5,6, and for all pairs of 
them we have computed: 

• The cophenetic distances d^^i and d,p,2 on BTn and Tn- 

• The Robinson- Foulds distance ^rf on BTn and Tn p6] . 

• The classical nodal distances c?nodai,i and dnodai.2 on BTn, which compare the vectors of distances 
between pairs of taxa by means of the Manhattan and the Euclidean norms, respectively; see |37| 
and [9], respectively, as well as [s]. 

• The splitted nodal distances rf^o^ai i and rfno^^i 2 ™^ ; which compare the matrices of splitted path 
lengths between pairs of taxa by means of the Manhattan and the Euclidean norms, respectively; 
see [5]. 

In order to analyze this data, we have plotted 2D-histograms for all pairs of metrics and we have computed 
their Spearman's rank correlation coefficient. On the one hand, the 2D-histograms for BTe and Te (the most 
significative case) are given in Figures [t] and [sj respectively. For each pair of distances, we have divided the 
range of values that each of the distances gets into 25 subranges, and computed how many pairs of trees fall 
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into each of the 25 x 25 different possibilities. Each of these possibihties is represented by a rectangle in a 
grid, whose darkness level is proportional of the number of trees. On the other hand, the Spearman's rank 
correlation coefficient between the aforementioned distances in the most significative case of n — Q are given 
in Tables Eland El 
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Figure 7: 2D-histograms showing the relationship between different distances on BTq. 
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Table 2: Spearman's rank correlation coefficient between different distances on BTq. 

These histograms and tables show that d^ i and 2 are highly correlated, and that each d^p i, i = 1,2, 
is highly correlated with the corresponding rf^o^jj^j j on Tq- This is not a surprise, because both types of 
metrics are based on encodings of phylogenetic trees related to the position in the tree of the LCA of 
every pair of leaves: remember the relationship between depths, cophenetic values and splitted path lengths 
recalled in the Background section. More surprising to us is the low correlation between each (i^,^, and the 
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Figure 8: 2D-histograins showing the relationship between different distances on Te- 
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Table 3: Spearman's rank; correlation coefficient between different distances on Tq. 

corresponding dnodai.i on STe, because of the relationship between depths, cophenetic values and patristic 
distances also recalled in the Background section. The very low correlation between the cophenetic metrics 
and the Robinson-Foulds metric simply shows that these metrics measure different notions of similarity. 

Our second experiment is for values of n greater than 6. The numbers of trees in each of the spaces Tn 
and BTn make it unfeasible to compute the distances between all pairs of trees. Hence, we have randomly 
and uniformly generated pairs of trees in each of these spaces for n = 10, 20, . . . , 100 until the approximated 
value of the Spearman's rank correlations of all pairs of distances converge up to 3 significant digits. The 
corresponding 2D-histograms and Spearman's rank correlation coefficient tables for the most significative 
case of n = 100 are shown in Figures [9| and [T0| and Tables |4] and |5] These diagrams and tables confirm the 
very high correlation between d;p,i and d,p,2, and very low correlation of these metrics and the nodal and 
Robinson-Foulds metrics. The correlation between each djp.i, i = 1,2. and the corresponding d^^^^^j ^ is still 
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significant, but it decreases as n increases. 
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Figure 9: 2D-histograms showing the relationship between different distances on STe 
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Table 4: 2D-histogranis showing the relationship between different distances on BT 



100- 



Finally, in Figure 11 we have plotted the histograms of the distributions of d^^i and dip^2 on BTn and Tn 
for n = 10, 20, . . . , 100. As it can be seen, they are positive skewed, like the splitted nodal metrics |5, Fig. 
5], but unlike other metrics like the Robinson- Foulds ^33j or the transposition distance |1, Fig. 2], which are 
negative skewed, or the triples metric |7 , which is approximately normal. 



Conclusions 

Following a fifty years old idea of Sokal and Rohlf [32j, we have encoded a weighted phylogenetic tree with 
nested taxa by means of its vector of cophenetic values of pairs of taxa, adding moreover to this vector the 
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Figure 10: 2D-histogranis showing the relationship between different distances on Te- 



Tioo 


d^,2 


"nodal,! 


"nodal. 2 






0.987184 


0.731755 


0.753918 


0.091556 


dip.2 




0.780030 


0.803423 


0.088390 


"nodal.l 






0.990944 


0.132030 


"nodal, 2 








0.118336 



Table 5: 2D-histograms showing the relationship between different distances on Tioo- 



depths of single taxa. These positive real-valued vectors single out weighted phylogenetic trees with nested 
taxa, and therefore they can be used to define metrics to compare such trees. We have defined a family of 
metrics for p e {0} U [1, oo[, by comparing these vectors through the norm. 

We cannot advocate the use of any cophenetic metric d^^p over the other ones except, perhaps, warning 
against the use of the Hamming distance d^pfl because it is too uninformative. Since the most popular norms 
on are the Manhattan and the Euclidean L^, it seems natural to use d^^\ or d^^2- And since these 
two metrics are very highly correlated, the comparison of trees using one or the other will not differ greatly. 
Each one of these metrics has its own advantages. 

On the one hand, the computation of d^p^i does not involve roots, and therefore it can be computed 
exactly. Moreover, it takes integer values on unweighted trees and in this case its range of values is greater. 
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Figure 11: Histograms of the distributions of dy,^i and d^_2 on Tn and BT„ for n = 10, 20, . . . , 100. 
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thus being more discriminative. Actually, since ||a;||p ^ ||a:||i for every x S and p ^ 1, we have that 

d^,p{Ti,T2) d^a(ri, T2) for every Ti, T2 G WTn- 

On the other hand, the comparison of cophenetic vectors by means of the Euclidean norm enables the use 
of many geometric and clustering methods that are not available otherwise. In particular, it is possible to 
compute the mean value of the square of dip^2 under different evolutionary models. We shall report on this 
elsewhere. 

As a rule of thumb, and as we already advised in the context of splitted nodal metrics js], we suggest 
using d^^i when the trees are unweighted, because these trees can be seen as discrete objects and thus 
their comparison through a discrete tool as the Manhattan norm seems appropriate. When the trees have 
arbitrary positive real weights, they should be understood as belonging to a continuous space f?, and then 
the Euclidean norm is more appropriate. 

Future work will include a deeper study of the distribution of d^p^i and d^^2 on different spaces of un- 
weighted phylogenetic trees. 
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Appendix: Proofs of Propositions [l]-[4] 

Proof of Proposition [l] 

By Lemmajl] it is enough to prove that the minimum non-zero value of Dq is 1 , and that all pairs T, T' G UTn 
such that Do{T,T) = 1 also satisfy that Dp{T,T') = 1 for every p^l. 

As we have seen in Example [2] if we contract a pendant arc in a tree T, we obtain a new tree T' such that 
Dp{T, T') — 1, for every p G {0} U [1, oo[, and this is of course the smallest possible non-negative value of Dp 
on lATn- It remains to prove that this is the only way we can obtain a pair of trees such that Dq(T, T') — 1. 

So, let T,T' G lATn be such that (f{T) — (p{T') + m ■ Ci^j for some m ^ 1 and 1 ^ i, j ^ n (where Cij 
stands for the vector of length n(n + l)/2 with all entries except an 1 in the entry corresponding to the pair 
(z,j)); that is, T and T' are such that ipTihj) — ^T'ihj) + "ti, for some m ^ 1, and LpT{x,y) = LpT'{x,y) 
for every (x,?/) ^ [hj)- Let us prove first of all that m = 1. So, assume that m ^ 2 and let us reach a 
contradiction. 
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Since ipT{i,j) > 0, there exists some taxon k =/= i,j that is a descendant in T of the parent of [«,j]t- In 
other words, such that [i,k]T — [j,k]T is the parent of [i,j]T- But then 

(pT'{i,k) = fT{i,k) = ifriij) - 1 = •^T'{i,i) + {m-l)> ipT'{i,j) 
'PT'{j,k) = ipT{j,k) = (pT{i,j) - 1 = fT'{i,j) + (m - 1) > (pT'{i,j) 

which cannot hold simultaneously: if ipx' (hk) > ipx' {hj)i then tpx' {j, k) — ipT' {hi)- This shows that m = 1, 
and thus (f{T) — i^{T') + e^.j. 

Let us prove now that it cannot happen that i ^ j. Indeed, assume that i ^ j. If ipT'{i,j) = Sx'ii), then 

'PriiJ) = (fiT'iiJ) + 1 = (5t'(«) + 1 = Srii) + 1, 

which is impossible. This implies that ipT'{i,j) < ST'{i),ST'{j)- If, now, ipT'{i,j) < Sx'ii) — 1, then 
there will exist some leaf k such that [i,k]T' is the child of [i,j]T' in the path from [«,j]t' to i. Then 
LpT'{i,k) = ipT'{i,j) + 1 and ipT'{j,k) = ipx'ihj), which entail that 

(pT{i,k) = (pT'{i,k) = (fiT'i-iJ) + 1 = VT(«,j) > 'PT'{-i,j) = 'PT'{j,k) = (pT{j,k), 

which is also impossible. So, if i ^ j, the only possibility is that ipT'{i,j) — (5t'(*) ^ 1 = ^T'{j) — 1, but then 
it would imply that ipT{i,j) = (fT'{i,j) + 1 = (5T(i) ~ ^tU) and hence that [i,j]T ~ i ^ j, which is again 
impossible. 

So, if tf{T) = (p{T') + ei,j then it must happen that i = j. In this case, moreover, i must be a leaf in T 
with unlabeled parent. Indeed, if i is not a leaf, then there is some leaf k such that i = [i,k]T and hence 
Sxii) — fTihk). Then, Sx'ii) — Sxii) — 1 = (fT{i,k) — 1 = (fT'ihk) — 1, which is impossible. So, i is a 
leaf in T. And if its parent is labeled, say with /, then (5t(z) = 6t{1) + 1 and 5t{1) — frihl)- Thus, in T' , 
Sx'ii) = Sxii) — 1 = St{1) — St'{1) and (5t'(«) = St(1) = 'firihl) = fT'iiJ), which is also impossible, since 
it would imply that [i,l]T' ~ i — 1- 

So, finally, it must happen that i is a leaf in T and its parent is not labeled. Let Tq be the phylogenetic 
tree obtained from T by contracting the pendant arc ending in i. Then ip{To) = ip{T) — Ci^i = (p{T'), and 
this imphes, by Theorem [l] that Tq = T . 

This finishes the proof that the only pairs T,T' £ WTn such that Do{T,T') = 1 are those where one of 
them is obtained from the other by the contraction of a pendant arc. Since these pairs of trees also satisfy 
that Dp{T,T') = 1 for every 1, this completes the proof of the proposition. □ 
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Proof of Proposition [2] 

To ease the task of the reader, we spht this proof into several lemmas. To begin with, notice that there are 
pairs of trees T, T' G 7^ such that Dp{T,T') ~ 3 for every p e {0} U [l,oo[: for instance, by Example [2] 
when T' is obtained from T by contracting an arc ending in the root of a cherry. So, the minimum non-zero 
value of D,p{T,T') on Tn is at most 3. 

Lemma 2. IfT,T' G Tn oltg such that Dq(T,T') > 0, then there exists a pair of different taxa i ^ j such 
that (firiij) 7^ (fiT'ihj)- 

Proof. If tpT{i,j) = ipT'ihj) for every i ^ j, then, by Corollary[l] T = T' and therefore Dq{T,T') — 0. □ 

So, every pair of phylogenetic trees in Tn at non-zero Dq distance must have a pair of different leaves 
with different cophenetic values. 

Lemma 3. Let T,T' € Tn be such that tpT{i,j) = fT'ihj) + iri', for some 1 ^ i < j ^ n and some m ^ 1. 
Let k ^ i,j be a leaf such that there exists a path from [i, jJt' to [i, k]T' of length I, for some I ^ 1. Then: 

(a) If (pT{i,k) = ifiT'ihk), then ipT{j,k) ^ ipT'{j,k) + mm{m,l} 

(b) If (firijjk) = ipT'{j,k), then ipT{i,k) ^ ipT'{i,k) - I 

Proof. From the assumptions we have that (pT'ih k) — ipx'ihj) + I — ^T'ij, k) + I. Now: 
(a) Assume that (pxii, k) = fT'{i, k). Then, 

(pT{i,k) = (fiT'ii, k) = (pT'{i,j) + I = ^T{i,3) - (m-l), 

and then 

• If m > /, then ipxii, k) < LpT{i,j), that is, [«, j]t ^ ^It, and thus 

VT{j,k) = (pT{i,k) = (pT'{i,k) = (pT'{j,k) +1 

• If m = /, then ipxii, k) — LpT{i,j), that is, [i, fcjy = [i, j]t, and thus 

<^t(j, k) > (pT{i,j) = (PT'{i,j) + m = (pT'{.j,k) +m 

• If m < /, then ipxii, k) > (pT{i,j), that is, [i, k]T -< j]t, and thus 

VT{j,k) =(fT{i,j) ^ (pT'{i,j) + m:^ ipT'{j,k) +m 
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(b) Assume that ipT{j,k) — LpT'{j,k). Then 

'PT{j,k) = ^T'{j,k) = ifiT'iiJ) = fTihj) -m, 
so that [i,j]T -< [j,k]T, and thus 

(pT{i,k) = (pT{j,k) = ipT'{j,k) = fT'{i,j) = fT'{i,k) - I 

□ 

As a direct consequence of this lemma we obtain the following result. 

Corollary 2. Let T,T' € Tn be such that Lpj'[i,j) = ipT'{i,j) +m, for some 1 ^ i < j ^ n and some m ^ 1. 
Let N be the number of leaves k such that k ^ i, j and either [Jj/cJt' ^ [^iJIt' or [j, fcjr' ^ [hiW'- Then, 

Do{T,T') + 

□ 

Lemma 4. Let T.T' E %i be such that Dq{T,T') ^ 3. If ipTihj) = fT'{iii) + 'm, for some 1 ^ i < j ^ n 
and some m ^ 1, then m ~ 1. 

Proof. If (5t'(*) ~ ^Tii), then Sx'ii) = Sxii) > ^rihj) = 'fiT'ihj) + tu which implies that there are at least 
TO leaves k such that [i, k]T' -< [i,j]T'- Then, by the last corollary, Do{T, T') ^ m + 1. Now, if &T'{j) = ^tU), 
then for the same reason there are at least to leaves k such that [j, k]T' -< [«, j]t' and they increase Dq{T, T') 
to at least 2m + 1, while if 6T'{j) ^ StU), then Do{T,T') > to + 2. We conclude then that if 5t'(«) = Srii), 
then m = 1. By symmetry, if St'U) = Sxij), then to = 1, either. 

Finally, if ST'{i) ^ Sxii) and St'U) ^ (^tO), and since tpT{i,j) ^ (pT'ihj), we have that ipT{x,y) = 
(fT'{x,y) for every (x, y) ^ {j, j), j)- Let now fc ^ i,j be a taxon such that [i,k]T — [j,k]T is the 
parent of [«,j]t in T. Then 

(pT'{i,k) = ipT{i,k) = ifriij) - 1 = 'PT'ihj) + (to - 1) 

and therefore, if to ^ 2, (pr'ih k) > (pT'{i,j) and then, by Lemmajs] either (pTih k) ^ '^T'{h k) or (prij, k) =/= 
(pT'U, k), which, as we have seen, is impossible. Thus, to = 1 in all cases. □ 

Lemma 5. Let T,T' e % be such that Do{T,T') ^ 3. IfipxiiJ) = ^T'{i-,j) + 1, for some I ^ i < j ^ n, 
then {ST'{i) - ^T'{i,j)) + (<5t'0') - 'PT'{i,j)) ^ 3. 
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Proof. Let us assume that ((5t'(*) ^ fT'ihj)) + (^T'O) ~ fT'ihj)) ^ 4 and let us reach a contradiction. 

Assume first that (5t'(*) ^ VT'ihi) + 3. Then, there are at least two leaves fci,fc2 such that 
[z, fci]T', [i, ^2]t' ^ [*ij]T'- Since each such leaf contributes at least 1 to Do{T,T') ^ 3, we conclude 
that there must be exactly two such leaves and, moreover, ipT{x,y) = LpT'{x,y) for every {x,y) ^ 
(i, j), (i, fci), (j, fci), (i, A;2), (j, ^2)- But then, on the one hand, ST{j) — <5t'(j) and, on the other hand, 
^T'ij) — '^T'ihj) + 1 (otherwise, there would be some other leaf k such that [j, fcj^' ■< [*, jIt'; which, by 
Lemma [3] would satisfy that ipxihk) ^ tpT'{i,k) or tpxijjk) 7^ (pT'{j,k)). Combining these two equal- 
ities we obtain Sxij) — (pTihj), which is impossible in a tree without nested taxa. This proves that 
Sx'ii) ^ ^T'ihj) + 2 and, by symmetry, that St'IJ) ^ ^T'ihj) + 2, as we claimed. 

Thus, it remains to prove that the case 6T'{i) — St'U) = (pT'{i,j) + 2 is impossible. So, assume this case 
holds, and let's reach a contradiction. By Corollary [2] if Da{T,T') ^ 3 and Sr'ii) — St'U) — ^T'{i,j) + 2, 
then there can exist only one extra leaf k pending from the parent of i and one extra leaf I pending from the 



parent of j: see Fig. 12 where the grey triangle stands for the (possibly empty) subtree consisting of all other 
descendants of [i,^]^'. Moreover, since ipT{i,j) = ipT'ihj) + 1 and since both k and / contribute at least 1 
to Dq{T,T') < 3, we conclude that iprix^y) = (pT'{x,y) for every {x,y) ^ {i,j),[i,k),{j,k),{i,l),{j,l). In 
particular 

(frikj) = ipT'{k,l) = fT'{i,j) = (friij) - 1 

Srii) = Sx'ii) = fT'{i,j) + 2 = tpriij) + 1 

^T{j) = Sxik) = 6t{1) — frihj) + 1 for the same reason 




Figure 12: The subtree of T' rooted at [«, j]t' in the proof of Lemmajs] 

Now we shall prove that, in this situation, each one oi k,l contributes actually at least 2 to Dq{T,T'), 
and therefore Do{T,T') ^ 5, which contradicts the assumption that Do{T,T') ^ 3 . 

(1) Assume that fTih k) = (pT'ih k). Then, by Lemmas |3](a) and|4] ipxij, k) — fT'U, k) + 1, and hence 

(fT{i,k) = ipT'{i,k) = (pT'ij,k) + 1 = (pT{j,k) 
(pT{i,k) = ipT'{i,k) = (pT'ihj) + 1 = friij) 
8t{i) = StU) = 5T{k) = 5t{1) = ^fiTiiJ) + 1 
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Thus, the subtree of T rooted at [k,l]T contains a subtree of the form described in Fig. 13 for at least 
one leaf h. But then 



which is impossible, since it would imply that h is another descendant of [l,i]T'- Therefore, ipT{i,k) ^ 
(pT'{i,k) and, by symmetry, (firUJ) 7^ (fiT'ijJ)- 



[k,l]T 




Figure 13: A subtree of the subtree of T rooted at [k, 1]t in case (1) in the proof of Lemmajs] 



(2) Assume now that (px{i,l) — v?t'(*iO- Then, by Lemma[3j(b), ipj'[j,l) = y^T'ij,^) ~ li £^nd then 

(pT(i,l) = (PT'{i,l) = 'PT'iiJ) = ^T{i,i) - 1 
•^tHJ) = 'PT'UJ) - 1 = ^T'{i,i) = 'firiij) ~ 1 
(firikj) = ipriij) - 1 



Therefore, the subtree of T rooted at [k, V\t contains a subtree of the form described in Fig. 14 for at 
least one leaf h. Moreover, k because (pTihJ) > frUJ) — frikj)- But then, again, 

which is again impossible by the same reason as in (1). Therefore, ifrih 7^ 'fiT'ih E^nd, by symmetry. 




Figure 14: A subtree of the subtree of T rooted at [k, 1]t in case (2) in the proof of Lemma [s] 
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So, 

and thus Do (T,T') ^ 5. □ 

Summarizing the last lemmas, we have proved so far that if Dq{T, T') ^ 3 and (fT^hj) 7^ '^T'{i,j), then, 
up to interchanging T and T' , ipT{i,j) = (fT'ihj) + 1 and either i and j are sibling in T' or one of these 
leaves is a sibling of the parent of the other one in T' . Next two lemmas cover these two remaining cases. 

Lemma 6. Let T,T' G Tn be such that Do{T,T') ^ 3, and assume that LpT{i,j) — ipT'ihj) + 1; for some 
I ^ i < j ^ n. If i and j are sibling in T' , then they are also sibling in T, they have no other sibling in T, 
and T' is obtained from T by contracting the arc ending in [i,j]T- And then, Dq{T,T') — 3. 

Proof. If 6T'{i) — ^T'(j) = ^T'ihj) + 1, then it must happen that Srii) = <5t'(*) + 1 ^^i^ Srij) ~ ^T'{j) + 1- 
Indeed, if (5t(*) ^ ^T'ii), then dxii) ^ (pT'ihj) + 1 = '^rihi), which is impossible. Therefore, 5T{i) > 5T'{i) 
and by symmetry SxiJ) > ST'{j)- Since ipT{i,j) — ipT'{i,j) + 1, DaiT,T') ^ 3 implies that ipT{x,y) = 
'PT'{x,y), for every {x,y) ^ {i,j), {i,i), Now, if, say 5t(«) ^ <5T'(i) + 2, then 

6T{i) > ST'{i) + 2 = ipT'iiJ) + 3 = ^riij) + 2 

and there would exist some leaf k such that [i,k]T is a child of [i,j]T- But then 

(fiT'ihk) = 'PT{i,k) = friij) + 1 = <fT'{i,j) + 2 = 5t'(«) + 1, 

which is impossible. This proves that Sxii) — ^T'{i) + 1 and, by symmetry, Srii) — ^t'U) + 1- 

So, in summary, (firiij) = ^PT'ihj) + 1- Srii) = (5t'(0 + 1, StU) = St'U) + 1 and ipT{x,y) = (pT'{x,y), 

for every (x,y) ^ ihj), and in particular d^ p{T,T') — 3. 

Now, dxii) = Sx'ii) + 1 = 'pT'ihi) + 2 = ipj'{i,j) + 1, and by symmetry, S^ij) — ipT{i,j) + 1, either. 

Therefore, i and j are sibling in T. Let us see that they have no other sibling in this tree. Indeed, if fc is a 

sibling of i and j in T, then 

ipT'{i,k) = ipT{i,k) = <y5T(j,j) = (PT'{i,j) + 1 = Sr'ii) 

which is impossible. 

Let X be the parent of [i^j]^, and assume that the subtree Tq of T rooted at x is as described in Fig. 
(a), for some (possibly empty) subtree T. Moreover, let Tq be the subtree of T' rooted at [i,j]T', which 
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is as described in Fig. 15 (b) for some subtree T'. We shall prove that T = T' . 
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(a) To (b) 

Figure 15: (a) The subtree To of T rooted at the parent of [i, j]T in the proof of Lemmajoj (b) The subtree 
Tq of T' rooted at [«, j]T' in the proof of the same Lemma. 

For every k € L(T), 

ipT'{i,k) = (pT{i,k) = (pT{i,j) - 1 = (pT'{iJ), 
which entails that k e L(T'). Conversely, if fc e L(T'), then 

(pT{i,k) = (pT'{i,k) = (pT'{i,j) = (PT{i,j) - 1, 

which entails that k e L{T). Thus, L{T) = L{T'). And finally, for every (not necessarily difi^erent) 

k,l£ L{f), 

Vf{k,l) = ipT{k,l) - Srix) = (firikj) - (firiij) + 1 = ^T'{k,l) - (pT'{i,j) = (f^{k,l), 

which implies by Theorem [l] that T — T' (notice that T and T' can have elementary roots). 

Finally, let us prove now that T and T' are exactly the same except for Tq and Tq. More specifically, let 
Ti and T[ be obtained by replacing in T and T' the subtrees Tq and Tq by a single leaf x. Since for every 
p,g^L(To)=i(T^), 

'Pt; {p, q) = fT' {p, q) = <^t(p, q) = "^ti {p, q), 

fT;{x,p) = ipT'{i,p) = ^PT{hP) = ipT^{p,x), 

we deduce, again by Theorem [l] that Ti=T[. 

This completes the proof that T' is obtained from T by replacing in it the subtree Tq rooted at the parent 
X of [i, jJt by the subtree Tq obtained from Tq by contracting the arc (x, [i,j]T)- D 

Lemma 7. Let T,T' E Tn be such that Do{T,T') ^ 3. Assume that ipTihj) = PT'{i,j) + 1; for some 
1 ^ i < j ^ n, and that j is a sibling of the parent of i in T". Then, the subtree of T' rooted at [i,j]T' is 



the tree Tq depicted in Fig. 16 (a), for some taxon k ^ i,j and some (possibly empty) subtree T' , and T is 



obtained from T' by replacing Tq by the tree Tq depicted in Fig. 16 (b). And then, Dq(T,T') = 3 



Proof. We assume that Sx'ii) = fT'ihj) + 2 and St'U) = fT'ihj) + 1- This implies that there exists at 
least one leaf k such that [i, k]^' -< j]t'- Since ipT{hj) = 'Px'ihj) + Ij ly^rii, k) — ifir'ih k) \ + I^PtU, k) — 
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(a) (b) To 

Figure 16: (a) The subtree Tq of T' rooted at [i,j]T' in the statement of Lemma [7j (b) The subtree Tq 
which replaces Tg in T in the same statement. 

VT'{i,k)\ > 1 and StU) > St'U) (because, otherwise, StU) < St'Ij) = (px'ihj) + 1 = VTii,j), which 
is impossible), Dq{T,T') ^ 3 entails that ipT{i,k) — ipT'{i,k) or ipT{j,k) — ipT{j,k), and that LpT{x,y) = 
(pT' {x, y) for every (x, y) 7^ (i, j), (i, fc), (j, fc), (j', j) (and, in particular, k is the only leaf different from i such 
that [i,k]T' -< [i,j]T')- Moreover, we have that Do{T,T') = 3. 

Let us see now that Sxij) — ST'{j) + 1- Indeed, if Sxij) ^ ^T'U) + 2, then 

StU) > St'U) + 2 = (pT'{i,j) + 3 = friij) + 2 
and there would exist some leaf / such that [j, 1]t is a child of [«, j]t- But then 

fT'{j,l) = (PtUJ) = 'firiij) + 1 = VT'(i, j) + 2 = St'U) + 1 
and we reach a contradiction. 



So, in summary, the subtree Tq of T' rooted a [«, j]T' is as described in Fig. 16 (a), and ipT{i,j) = 
•fT'{i,i) + 1, StU) = St'U) + 1, 'PT{x,y) = <fT'{x,y) for every {x,y) ^ {i,k), {j,k), and either 

ifTih k) = ifiT'ih k) or (PtUi k) — (fTij, k). Now, we discuss these two possibilities. 

(a) If (PtUj k) = '^T'U: k), then (pTih k) = ipT'ih ^) — 1 by Lemma[3](b). In this case 

(pT{i,k) = (fiT'i-i-.k) - 1 = ipT'ihj) = fTihj) - 1 

VTij,k) = (pT'{j,k) = (fiT'ihj) = fTihj) - 1 

^r(«) = i5t'(«) = (pT'{i,i) + 2 ^Tihj) + 1 
5r(j) = (5t'(j) + 1 = (^T'(j,j) + 2 = ^t(«, j) + 1 
5T{k)^ST'{k)^ipT'{i,j) + 2^ipT{i,j) + l 

This means that the subtree of T rooted at [i, k]T — [j, k]T contains a subtree of the form described in 



Fig. 17 for at least some new leaf h. But then 

ipT'{k,h) = (pT{k,h) = (pT{i,j) = VT'{i,i) + 1 = 'PT'{i,k) 

which is impossible in T', because i and k are the only descendants of [ijfcjr' in T'. So, this case is 
impossible. 
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Figure 17: A subtree contained in the subtree of T rooted at [i, jJt in case (a) in the proof of Lemma [T] 

(b) If Lprii, k) — ifiT'ih k), then (pxij, k) — (fT'ij, k) + 1 Lemmas |3] (a) and|4] In this case 

iprii, k) = (pT' {i, k) = (pr- (i, j) + 1 = (firii, j) 
fT{j,k) = (pT'{J,k) + 1 = VT'{i,j) + 1 = 'PT{i,j) 
Srii) StU) = Srik) = friij) + 1 as in (a) 

This implies that i, j, k are sibling in T. If / is any other sibling of them in T. then 

which entails that / is another descendant of [i, fcjj^' in T', which is impossible. Therefore, the subtree 
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for some subtree T. 



Tq of T rooted at the parent of [i, jJt has the form depicted in Fig. 

Finally, the same argument as in the last part of the proof of the last lemma shows that T = T' , and 
that if Ti and T{ are obtained by replacing in T and T' the subtrees Tq and Tq by a single leaf x, then 
Ti = T{. We leave the details to the reader. 




Figure 18: The subtree Tq rooted at the parent of [i,j]T in case (b) in the proof of Lemma |7] 



This completes the proof that T and T' are as described in the statement. □ 

We have proved so far that the minimum value of Dq on 7^^ is 3, and we have characterized those pairs 
of trees T,T' Tn such that Dq{T, T') — 3. To extend this result to every Dp, p ^ 1, it is enough to check 
that every pair of trees in Tn such that Dq{T, T') = 3 also satisfies that Dp{T, T') = 3 for every p ^ I, which 
is straightforward. This completes the proof of Proposition [2] 
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Proof of Proposition [3] 

As in Proposition [2] we also split this proof into several lemmas. First of all, notice that there are pairs of 



trees T,T' e BTn such that Dp{T,T') = 4 for every p e {0} U [l,oo[: see, for instance. Fig. 19 Therefore, 
the minimum value of Dp on BTn is at most 4. 




Figure 19: A pair of binary trees such that Dp{T,T') ~ 4. The grey triangles represent the same tree. 

Notice also that Lemma [2] also applies in BTn, and therefore, if T, T' £ BTn are such that Da{T, T') > 0, 
then there exist two taxa i ^ j such that ipT{i,j) 7^ (fiT'ihj)- And, of course, Lemma|3]also applies in BTn- 

Lemma 8. Let T,T' e BTn he such that Dq{T,T') < 4. IfipriiJ) = (pT'iiJ) + m, for some 1 sC i < j sC n 
and some m ^ 1, then m = 1. 

Proof. Assume that ipT{i,j) = 'fT'ihj) + with m ^ 2, and let us reach a contradiction. 

If 5T'{i) = 5t(*)i then (5y (?) > (fxii,]) — fT'ihj) + 'ti, and therefore there exist leaves xi, . . . ,Xni such 
that (pT{i,xi) = ifiT'ihj) + I, for I — 1, ... ,m. By Lemmajs] each such leaf xi adds at least 1 to Dq{T,T'). 
Therefore Do{T,T') ^ 1 + m. Now, if moreover 5T'{i) — ^tU), then there also exist leaves j/i, . . . ,ym such 
that ifTij, yi) = '^T' + h for Z = 1, . . . , m, and each such leaf yi also adds at least 1 to Do{T, T'), which 
entails Da{T,T) ^ 1 + 2to > 5. So, if Do{T,T') ^ 4, it must happen that 5t'(*) ^t(«) or St'U) 7^ ^tO) 
(or both). Let assume that St'U) 7^ ^tO)- 

Now, ipT{i,j) ~ ipT'ihj) + m ^ m, and therefore there exist leaves such that ipT(i,zi) — 

frii^zi) = Vt(«, j) - for i = 1, . . . ,m. If ipT{i,ki) = ipT'{i,ki), then 

ipT'{i,ki) = (PT{i,ki) = (firiij) - I ^ (fiT'ihj) + {m - I) ^ •^T'{i,j) 

and therefore, by Lemma [s] (fiT'ij, ki) 7^ frij, ki), and thus, each such leaf z; adds at least 1 to Dq{T,T'), 
which entails Do{T,T') ^ 2 + m. Therefore, if Dq{T,T') < 4 and to ^ 2, it must happen to = 2 and, 
moreover, <fT{a,b) ipT'{a,h) for every (a, 6) 7^ (i, j), (j, j), (i, zi), (i, Z2), (j, zi), (j, Z2). 
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In particular, 5t(*) — ^T'ii), which as we have seen impHes that there are at least two leaves xi,X2 such 
that i -< [i,X2]T' -< [i,xi]T' -< [i,j]T'- Since 

implies that (up to interchanging zi and Z2) i -< [i,zi]T' -< [i,j]T' and j -< [j,Z2]T' ^ [^iJIt', we conclude 
that {xi, X2, zi, Z2} are at least 3 different leaves and hence they contribute at least 3 to Do{T,T'), making 
Dq{T,T')^^. □ 

Lemma 9. Let T,T' e BTn be such that Da(T,T') ^ 4. IfipriiJ) = ^T'{hi) + 1, for some li^i < j ^n, 
then dT'{i),ST'{j) ^ (PT'ihj) + 2. 

Proof. Let us assume that 5T'{i) ^ (pT'{i,j) + 3, and let us reach a contradiction. The case when St'U) ^ 
fT'ihj) + 3 is symmetrical. 

Since ipT{i,j) — (pT'ihj) + 1 > 0, there exists some taxon ko such that [z, /cqJt is the parent of [i,j]T- 
Let us distinguish several cases. 

(a) Assume that ipTihko) — (/5t'(*7^o)- Then. (pT'{i,kQ) = (^9^(1, feg) = ipT{i,j) — 1 = (pT'ihj) implies that 
[j, fco]T' -< [iJW' and thus ipr'ij.ko) > (pT'{iJ) = 'Prihi) - 1 = ipTij,ko) and in particular, by the 
previous lemma (pT'ij,ko) — ipxij^ko) + 1 = ipT(i,j) = (pT'ihj) + 1- Now, since Do{T,T') ^ 4, by 
Lemma [4] the number of leaves a ^ i,j,kQ such that a -< [i, jJt' is at most 2. 

If Sx'ii) y^T'ihj) + 3, then there exist leaves fci,fc2 such that (pT'{i,ki) — ipT'{i,j) — 

1 and LpT'{i,k2) = ipT'ihj) — 2 and then (pT{x,y) — ipT'{x,y) for every {x,y) ^ 

(i, j), {i, ko), {j, ko), (ki,i), (ki, j), {k2,i), {k2, j). In particular, no leaf other than z,j,fco,fci,fc2 descends 

from [i,j]T'- But then 

(pT{ki,kQ) = LpT'{ki,ka) = (pT'{i,j) = 'PriiJ) - 1, (pT{k2,ko) = ipriij) - 1 
(pT{ki,k2) = (pT'{ki,k2) = (pT'{i,j) + 1 = VT(i,j) 

imply that, up to interchanging ki and fc2, i ^ [i,fci]T ^ [hJW and j -< [j,k2]T ^ [iiilr, and then 

^T'(i) = StU) > (PT{i,j) + 1 = fT'{i,j) +2 

implies the existence of at least another leaf h such that j -< [j, -< [j, ^oIt' ^ [hiW', which, as we 
have mentioned, is impossible. So, this case cannot happen. 

(b) Assume now that (fTijiko) = ipT'{j,kQ). By symmetry with the previous case, this implies that 
LpT'ihko) — ipT'{i,j) + 1, y^T'ihko) — ipTihko) + 1 and that the number of leaves a ^ hj,ko such 
that a -< [j, j]t' is at most 2. Now we have three new subcases to discuss. 
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(b.l) If ST'{i) = ^T'{i,j) + 4, so that there exist leaves k-i,k2 ^ i such that 
ifT'ii, ko), VT'{i, ki), ^T'{i, ^2) > VT'{i,j), and no leaf other that ko,ki, k2 descends from [i,j]T'- 
Then ipT{x,y) = ipT'{x,y) for every {x,y) ^ {i,j),{i,ko),{j,ko),{ki,i),{ki,j),{k2,i),{k2,j). But 
in this case it must happen that drij) = St'U) = (pT'{i,j) + 1 = fT{i,j), which is impossible. So, 

this case cannot happen. 

(b.2) If 5T'{i) = ^T'ijij) + 3 and St'U) = ^T'ihj) + 2, so that there exist leaves ki,k2 such that 
^T'{j,ki) = ifT'iiJ) + 1, V>T'{i,k2) = ^T'{i,j) + 2 and, recall, ipT'{i,ko) = ipT'{i,j) + 1, then 
^T{x,y) = ipT'{x,y) for every {x,y) ^ {i,j),{i,ko),{j,ko),{ki,i),{ki,j),{k2,i), {k2,j). But then 

(pT{ki,ko) = (pT'{ki,ko) = <PT'{i,j) = friij) - 1 

implies that ki -< [i,j]T, and then 

StU) = St'U) = 'PT'{i,j) + 2 = (fiT{i,j) + 1, 
Sriki) = Sr'iki) = ipT'{i,j) + 2 = ipriij) + 1 

imply that j and ki are the only children of [i, j]tj which is, of course, impossible. So, this case 
cannot happen, either. 

(b.3) If 6t' {i) = ^T' {i, i)+3 and St' {j) = fx' {i, then on the one hand there exists a leaf ki such that 
ifT'ii, ki) = ifiT'ij, ko) — 1 = ^T'{i,j) — 2 and, on the other hand, as we have seen in (b.l), ^t(j) > 
St'U). Then, ipT{x,y) = ^T'{x,y) for every {x,y) {i,j),{j,j),ii,ko),{j,ko),{ki,i),{ki,j), and 
in particular no leaf other than i,j,ko,ki descends from [i,j]T'- 
Now, 

ipT{ki,ko) = ipT'{ki,ko) = (pT'{i,j) + 1 = ifiriij) 
implies that ki 7^ [i,j]T, and 

Srii) = Sr'ii) = fT'iiJ) + 3 = (priij) + 2 

implies that there exists a leaf ko,ki such that i ^ /i]t -< [hjW and hence 

'fiT'{i,h) = (pT{i,h) > ifriij) + 1 = V>T'{i,j) 

would entail that h -< [i,j]T', which is impossible. Thus, this case cannot happen, either. 

(c) Assume finally that ipT{i,ko) ^ (pT'{i,ko) and (^tOj^o) V^T'{j,ko). The contribution to Dq of the 
pairs {i, ko), (j, ko) is at least 3, and therefore there can only exist at most one other pair of leaves 
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with different coplienetic value in T and in T". Since every x ^ i,j such that x -< [i, j]t' defines at least 
one such pair, we conclude that if St'H) ^ ipT'ihj) + 3, then, it must happen that [i,fco]T' ^ [hJW' 
and that there can only exist one leaf ki ^ ko,i such that -< [i,j]T', and then, moreover 

[i, A:o]t' 7^ [i,ki]T'- In this case, ipT{x,y) = (pT'{x,y) for every {x,y) ^ (i,fco), (j, fco), {ki,i), {ki,j). 

But then, in particular, St'U) — ipT'{i,j) + 1 and StU) ~ St'IJ), which implies Srii) — <y5T(*, j), which 
is impossible 

This finishes the proof that, if Dq{T,T') 4, then Sr'ii) ^ VT'{iJ) + 2 and St'U) ^ ^T'ihj) + 2. □ 

Lemma 10. LetT,T' G BTn he such that Da{T,T') < 4. IfipriiJ) = <^t'(«, + for some 1 z < j < n, 
then i,j are sibling in T. 

Proof. Let feg be any leaf such that [i,kQ]T = [j,ko]T is the parent of [i,j]T hi T. If (^t(*7^o) = ^T'ihko), 
then ipT'{i,ko) = </?t(«,^o) = Vrihi) - 1 = (pT'{i,i) implies that [j, A;o]t' ^ [«, jIt' and thus LpT'ij.kn) > 
fT'iiJ) = (friij) - 1 = (pT{j,ko). Therefore, |(^T(«,fco) - 'PT'{i,kQ)\ + \ipT{j,ko) - ipT'{j,ko)\ ^ 1. 

Assume now that i,j are not sibling in T, and let be a leaf such that [i, /i]t is a child of [i,j]T- If 
(pT{i,h) ^ ipT'{i,h), then 

5T'(i) ^ <^T'(i, h) + l^ iprii, ft.) + 1 = (fiTihj) + 2 = (pT'ihj) + 3 

which is impossible by the previous lemma. Therefore, ipT{i,h) > ipT'{i,h), and by Lemma [s] ipT{i,h) = 
ipT' {i, h) + L 

In a similar way, if 5t(*) = (^T'li), then 

5T'{i) = (^tC*) > V'Tli, ft) + 1 = fTihj) + 2 = (fiT'{i,j) + 3 

which is again impossible by the previous lemma. Therefore, Sxii) 7^ <^T'(*), too. So, (i,j), (i^ko), (j, fco), 
(i,j), and (i,ft) contribute at least 4 to Dq(T,T') ^ 4, which implies that ipT{x,y) — ipT'{x,y) for every 
other pair of leaves (x, y). But then, 

fT'{j,h) = ^T(i, ft) = 'firiij) = •^T'{i,j) + 1 

which is impossible. Therefore, i and j are sibling in T. □ 

Lemma 11. LetT,T' G ST„ 6e sitcft that Do{T,V) < 4. IfipriiJ) = ipT'{i,j) + l, for some Is^i < j ^n, 
then i,j are not sibling in T' . 



35 



Proof. Assume that i,j are sibling in T' , and recall that we already know that they are sibling in T. Let feg 
be any leaf such that [i, kojx = [j, fcolr is the parent of [i, jJt in T. If (frih ^o) = fT'ii, ko), then 

which is impossible if i,j are sibling in T'. Thus, (^t(*,^o) 7^ y^T'(j-,ko) and, by symmetry, ipT{j,kQ) ^ 
ipT'ij, ko). On the other hand, if Sxii) = Sx'ii), then 



which is also impossible. Therefore, (5t(«) ^ <^t' (*) and, by symmetry, ^t(j) <5t' (j)- But, then, Dq{T, T') ^ 
5. □ 

Summarizing what we know so far, we have proved that if Dq{T,T') ^ 4 and ipT{i,j) 7^ (pT'ihj), then, 
up to interchanging T and T', ipT{i,j) — ipT'ihj) + 1, J are sibling in T, and then the subtree of T' rooted 



at [i, j]T' is a triplet or a totally balanced quartet; cf. Fig. 20 Next two lemmas cover these two possibilities 




Figure 20: The only possibilities for the subtree of T' rooted at [i, jJt' if Do{T,T') ^ 4 and (pT{i,j) 
ipT'{i,j) + 1. 



Lemma 12. LetT.T' e BTn be such that Dq{T,T') < 4. IfipriiJ) = ¥'t'(«, + for some li^i <j ^n, 
and the subtree ofT' rooted at [«, j]t' is the triplet depicted in the left hand side of Fig. \2C\ then T is obtained 
from T' by interchanging j and k: cf. Fig. 
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And, then Dq{T,T') = 4. 




r 



T 



Figure 21: The only pairs of trees T,T' such that Do{T,T') < 4 and ipriij) = (pT'{i,j) + 1, when the 
subtree of T' rooted at [«, j]T' is a triplet. 
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Proof. Assume that the subtree of T' rooted at j]T' has the form depicted in the left hand side of Fig. 



20 and that ipT{i,j) = fT'ihj) + 1- Then, since i and j are sibhng in T. 

StU) = ^T{i,j) + 1 = ^T'ii,j) + 2 = St'U) + 1. 

Now, iiifxiijk) ^ ipT'ihk), then 

(pT{i,k) ^ (pT'{i,k) = (pT'{i,j) + 1 = fT{i,j) 

which is impossible, because i and j are sibling in T. Therefore, (pT{i,k) < ifx'ihk) and, by Lemma [s] 
ifirih k) = ^T'ih k) — 1, and in particular (prih k) = y^rU, k) = iprii^j) — 1- Therefore, [i, fcjr is the parent 
of [i,j]T in T. 

Finally, if Sxik) ^ Lpxihi) + 1, then there exists at least some other leaf / -< [i, fcjy — [j, ^It- But then 
ifxihl) 7^ Vt'(*iO; because otherwise 

which is impossible because the only leaves descending from [«,j]T' are i,j,k. And, by symmetry lptUJ) 
Vt'UJ), and we reach Do{T,T') ^ 5. Therefore, 

Srik) = (firiij) = VT'{hi) + 1 5T'{k) - 1. 

So, in summary, (firiij) = fT'ihj) + 1, StU) = 6T'{j) + l, ifiTihk) = fT'ihk) - 1, and Srik) = Sx'ik) - 1, 
and ipT{x,y) — ipT'{x,y) for every {x,y) other than {i,k), {k,k). Moreover, in T, k is the other 

child of the parent of [?, j]T- 

So, the subtree Tq of T rooted at the parent of [i,j]T is obtained by interchanging j and k in the subtree 
Tq of T' rooted at [j,j]T'- Finally, let us prove now that T and T' are exactly the same except for Tq and 
Tq. More specifically, let Ti and T[ be obtained by replacing in T and T' the subtrees Tq and Tq by a single 
leaf X. Since for every p,q ^ {i, j, fc}, 

'Pti (p, q) = VT' (p, q) = (firip, q) = (pn {p, q), 

'Pt[{x,p) = (pT'{i,p) = ^T{i,p) = (pT^ix,p), 
we deduce, by Theorem [l] that Ti = T{. 

This completes the proof that T is obtained from T' by interchanging the leaf j and its nephew k. □ 

Lemma 13. LetT,T' e BTn be such that Da{T,T') < 4. IfipriiJ) = <^T'(i, + for some 1 sC z < j < n, 



and f/ie subtree of T' rooted at [?,j]t' quartet depicted in the right hand side of Fig. 20 then T 



obtained from T' by interchanging j and k: cf. Fig. 22 And, then Dq(T,T') — 4. 
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T T 

Figure 22: The only pairs of trees T,T' such that Do{T,T') < 4 and ipriij) = VT'{i,j) + 1, when the 
subtree of T' rooted at [«, j]t' is a quartet. 

Proof. Assume that the subtree of T' rooted at [i, j]t' has the form depicted in the right hand side of Fig. 



20 and that ipT{i,j) = ipT'ihj) + 1- 
If (frih k) ^ fT' ihk), then 

which is impossible if i,j are sibling in T. Therefore, (pT{i,k) < (pT'{i,k) and, by Lemma [sj (pT{i,k) ~ 
Lpj"{i,k) — 1, and in particular ipT{i,k) = ipT{i,j) — 1. By symmetry, ipT{j,l) = ipT'{j,l) ~ 1 and hence 
ipT{j,l) — ipTihj) — 1, too. Therefore, both k and / are descendants of the parent of [i,j]T- But then, 

fT'{k,l) = fT'{i,j) friij) - 1 < ifrikj) 

and therefore, by Lemmajs] (pT{k,l) = (pT'ik,l) + 1 = ipT{i,j)- 

At this point, Dq{T,T') ^ 4 entails that ipT{x,y) = tpT'ix,y) for every (x, y) other than 
(i, j), {i,k)^ (j, I), {k, I). Moreover, i, k,j,l are the only descendant leaves of the parent of [i, jJt^ in T. Indeed, 
if h is another descendant leaf of the parent of [«, jJt'i then 

and therefore h would be another descendant of [i, And, as we have seen, the subtree Tq of T rooted at 
this node is obtained from the subtree Tq of T' rooted at [i,j]T' by interchanging j and k. Finally, arguing 
as in the last part of the proof of the previous lemma, we deduce that T and T' are exactly the same except 
for To and T^. □ 

We have proved so far that the minimum value of Dq on BTn is 4, and we have characterized the pairs 
of trees T, T' G BTn such that Dq{T, T') — 4. To extend this result to every p ^ 1, it is enough to check 
that every pair of binary trees such that Dq{T,T') = 4 also satisfies that Dp{T,T') — 4 for every p ^ 1, 
which is straightforward. This completes the proof of Proposition |3] 
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Proof of Proposition |4] 

Let Xn denote any space UTn, Tn or BTm and let Ap(X„), p G {0} U [1, oo[, denote the diameter of d^p^p on 

Xn.. 




Figure 23: (a) The rooted star with n leaves, (b) The only maximally balanced tree with 5 leaves, up to 
relabelings. (c) A rooted caterpillar with n leaves. 



We consider first the case p = 1, which will be used later to prove the case p > 1. For every T G UTn 



let 



2—1 l^i<j^n 



S and $ are the extensions to UTn of the Sackin index |28| and the total cophenetic index 20 for phylogenetic 



trees without nested taxa, respectively. Notice that ||(/3(T)||i = S{T) + <i>(T). We have the following results 
on these indices: 

• It is straightforward to check that the minimum values of S{T) and <i>(T) on Tn are both reached at 



the rooted star tree with n leaves (the phylogenetic tree with all its leaves of depth 1; see Fig. 23 (a)), 
and these minimum values are, respectively, 

min5'(7^i) = min$(7^) = 0. 



• It is also straightforward to check that the minimum values of S{T) and $(r) on UTn are both reached 
at the rooted star tree with n — 1 leaves and with the root labeled with n, and these minimum values 
are, respectively, 

min S{UTn) = n - 1, min ^{UTn) = 0. 

• The minimum values of S{T) and $(T) on BTn are both reached at the maximally balanced trees with 
n leaves (those binary trees such that, for every internal node, the numbers of descendant leaves of its 



two children differ at most in 1; see, for instance. Fig. 23 (b)). And then, these minimum values are 
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respectively, 

min S'(Sr„) = n[log2(4n)J - 2Li°g2(2")J 

ri-l 



min$(STn) = where a(fc) is the highest power of 2 that divides 



n. 



fe=0 



For the proofs, see 30 combined with |21j for S, and [20] for $. From the first formula it is clear 
that min S{BTn) is in 6(nlog(n)). As far as min $(STn) goes, it is shown in ^20j that it satisfies the 
recurrence 

min ^BTn) = min ^BT ) + min $(SrL„/2j ) + ^ ) + ^ ) , forn ^ 3 

from where it is obvious that its order is in Q{n^). 

The maximum values of S{T) and $(7") on both Tn and BTn are reached at the rooted caterpillar trees 
with n leaves (binary phylogenetic trees such that all their internal nodes have a leaf child; see Fig. 
|23](c)). And then, these maximum values are, respectively, 

max S{Tn) = max SiBTn) ^ ( " 2 j " "^^^ ^C^') = max $(Sr„) = ( 3 / > 



which are thus in 8(n ) and Q{n'^), respectively. For the proofs, see again [30 for S and 20 for 



Given any tree in UTn with a nested taxon, if we replace this nested taxon by a new leaf labeled 



with it pending from the node previously labeled with it (cf. Fig. 24 1, we obtain a new tree in UTn 
with strictly larger value of S and the same value of $. This shows that the maximum values of S{T) 
and (f>(T) on UTn are reached at trees in Tn, and hence at the rooted caterpillar trees with n leaves. 
Therefore, they are also in Q{n^) and Q{n^), respectively. 



Figure 24: This operation increases the value of S and does not modify the value of $. 

From these properties we deduce the following result. 

Lemma 14. The minimum value of \\ip{T)\\i onUTn and Tn is in 0(n). The minimum value of \\ip(T)\\i 
on BTn is at most in 8(n^). The maximum value of \\ip{T)\\i on UTn, Tn o,nd BTn is in Q{n^). □ 
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Now, we can apply this lemma to find the order of the diameter of d^p^i on the spaces X„ of unweighted 
phylogenetic trees. 

Lemma 15. The diameter of d^^i on lATm %i and BTn is in &(n^). 

Proof. Let Ti,T2 G X„. Then, on the one hand, 

d^ATuT2) = MT,) - ^(T2)||i ^ ||^(ri)||i + ||^(r2)||i ^ 2 ■ max||<^(X„)||i = Q{n') 

which shows that Ai(X„) < O(n^). On the other hand, if ||v(Ti)||i > ||(p(T2)||i, then 

d^ATi,T2) = MT,) - ^(T2)||i > ||^(ri)||i - ||^(T2)||i 

and therefore Ai(X„) > max ||(^(X„)||i — min ||<^(X„)||i, which is again in O(n^). This shows that Ai(X„) 
is in 6(n^), as we claimed. □ 

Let us consider now the case p> 1. Since, for every x e M"*, ||a;||i < m^~p||a;||j„ we have that, for every 
pair of trees Ti,T2 G X„, 

2 j rf^,p(Ti,T2). 

and therefore 

1— - 

from where we deduce that 

To prove the converse inequality, let 

We have that, for every Ti, T2 G 

d^ATi,T2) =MT,)-^(T2)\\p^MT,)\\p + MT2)\\p= ^ <p(p) {T,) + i/ ^(p) {T2) 
< 2^max93(p)(X„), 



which implies that Ap(X„) ^ 2 max ^p^P^Xn)■ Therefore, to prove that the diameter of dip_p on each Xn 
is bounded from above by 0{n^P~^'^^^P), it is enough to prove that max(^(P^(X„) < 0(n^+^). We do it in the 
next lemma. 

Lemma 16. The maximum value of ip^P\T) onUTn, Tn or BTn is reached at the rooted caterpillars, and 
its value is in 0(n^'+^). 

41 



Proof. Arguing as in the case p — 1, we have that the maximum vakie of Lp^P^{T) on lATn is reached on 
trees in Tn , because if we replace each nested taxon in a tree by a new leaf labeled with the same taxon as 
the value of ip'^'P^ increases. On the other hand, if a tree T £ T,i contains a node with k ^ 3 



in Fig. 
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children, as in the left hand side of Fig. |25] and we replace its subtree rooted at this node as described in 



the right hand side of Fig. 25 we obtain a new tree T' e T,i with larger 93*^^^ value: the values of ^{i^jY for 
i,j e L{Ti) U • • • U L{Tk-i) increase, and the other values of ip{i,j)P do not change. This implies that for 
every non-binary phylogenetic tree T £ Tn, there always exists a binary phylogenetic tree T' G BTn such 
that V3(P)(T') > (^(P)(T) and in particular that the maximum value of (p'^P'^ [T) on UTn is actually reached 
on BTn- 




Figure 25: ip^P^T) > Lp^P\T). 





Figure 26: Lp'~P'> {T) > p^P^ {T) 



ip)( 



Let now T e BTn and assume that it is not a caterpillar. Therefore, it has an internal node z of largest 
depth without any leaf child; in particular, all internal descendant nodes of z have some leaf child. Thus, 
and up to a relabeling of its leaves, T has the form represented in the left hand side of Fig. |26[ for some 



fc ^ 2 and some I ^ k + 2. Consider then the tree T' depicted in right hand side of Fig. 26 where the grey 
triangle represents the same tree in both sides. It turns out that p^P\T') — Lp''P^{T) > 0. Indeed, if q denotes 
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the depth of the node z in both trees, then 

if ?' = j = k 

{q + i)P -{q + i-k+l)P ifk + l^i = j^l-l 
,. ..p ,. I {q + l-l)P -iq + l-k)P ifi=j = l 

{q + i-l)P- {q + i-k)P if k + 1 ^ i < j ^ I 
{q + i-l)P-qP if 1 < i < < j < Z 

otherwise 

Therefore, 

k-l l-l 
^(P)(T')-^(P)(T) =^((g + i)f-(g + i + l)f)+ ^ ((g + i)P_(^ + i_fc + l)3') 

i=l i=k + l 

fe-1 

+{q + l-l)P-{q + l- k)P + ^(fc - + z - 1)P -{q + i)P) 

i-1 k 

+ J2 il-t){{q + i--^r-{q + i-k)P)+J2ii-k){{q + i-l)P-qP) 

i=fe+l i=l 
i-fe-1 

= (g+l)f-(g + fcf + 5^ + + 

i=l 

fc-1 

+ (g + Z _ 1)P _ + / _ + ^(fc _ + i _ l)f _ (g + 

l-k-1 k 



i=l 



i=l 



To prove that this sum is non-negative, let us write it as 



^(p\T')-ip^P\T) = Si + S2 + Ss, 



where 



fc-i 



Si = -i){{q + i- If -{q + if) + J2{1 -k){{q + i- If - qP) 



=1 

-k-l 



i=l 
l-k-1 



S2 
S3 



^ {{q + k + i)P-{q + l + i)P) + ^ {l-k-i){{q + k + i-l)P-{q + i)P) 



i=l 



i=l 



{q + 1)P -{q + k)P + {q + l-l)P-{q + l- k)P 
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Then 

fc-l k 



S, =Y,ik-^){{q + ^~ If -iq + +Y.{l-k){{q + z- If - qP) , 

i=l 1=1 

fe-1 fc-l k 

= Y.^k - i){q + Y,{k - + + ^(/ -k){{q + i- If - q'^), 

i—l i—1 i—1 

k-1 k k 

= Y,{k - l){q + ^(A: -t + l){q + l-l)P + {l- k) J^iq + i - If - k{l - k)qP 

i=l 1=2 i=l 

= ^(Z - A: - l){q + i-l)P + kqP - {q + k- If + (/ - fc)(q + k- If - - k)qP , 

i=l 

k 

^(/_fc_l)^((q + ,_lf _5P) >0 
i=l 

l-k-1 l-k-1 

S2 = ^ ((g + A: + i)P - (g + 1 + + J2 {l~k~i){{q + k + i-l)P - {q + i)P) 

i=l i=l 
l-k-1 l-k-1 

= J2 {iq + k + ir~{q + l + t7)+ il~k^i-l){{q + k + t)P - {q + i + l)P) 



i=l i=0 
l-k-1 

= 5Z - ^ - *)(('^ + ^ + " (9 + 1 + "-y) + {l-k-l){{q + k)P -iq + If) 
>{i-k-l){{q + k)P-iq + l)P). 
and therefore 

<^(P)(T')~^(P)(T) = ^1 + ^2 + 53 

> {l-k-l)({q + k)P - (g + If) + {q + 1)P - {q + k)P + {q + l-l)P - {q + l~ k)P 
= {l-k^2)l{q + k)P -{q + If) + {q + l^l)P -{q + l-k)P >Q. 

This impHes that no tree other than a rooted caterpillar can have the largest (^^^^ value in BTn, and 
hence also in %i and lATn- 



Finally, if Kn denotes the rooted caterpillar with n leaves in Fig. 23 (c) , 



(n 


- l)P 


if « = j = 1 


[n 


-i + l)P 


if 2 ^ i = J ^ n 


[n 




if 1 ^ i < J ^ n 



and thus 



ip^P\Kn) = (n - 2) • F + (n - 3) • 2P + • • • + 2 • (n - 3)P + 1 • (n - 2)P 
+F + 2P + • • • + (n - 2f + (n - l)P + (n - If 
= (n - 1) • F + (n - 2) • 2P + • • • + 3 • (n - 3)P + 2 • (n - 2)P + (n - 1)P + (n - 1)p 



71-1 



^(n-fc)-fcP + (n-lf 



fc=i 

Now. it turns out that 



n— 1 

Vfc'" = n"+i + 0(n"). (1) 

-^-^ m + 1 



k = l 
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This property is well known for natural numbers to S N |36| . For arbitrary real numbers to > 0, it derives 
from the fact that 

and then 

n-l 



k=l 



(x - l)"'dx = -^—(n - 2)"+i = + 0(n") 

1^ ' m + ^ TO+1 

x'^dx = (n - 1)™+! = + 0(n"') 

I TO + 1 ^ ^ TO+1 ^ ' 

So, by identity Q, we have that 

n— 1 71 — 1 n-l ^ 

(n — 1)'' = n 



k = l k=l k=l 

and hence ip^P\Kn) is in 6(^^+2) 



□ 



Therefore, 0(n(P+2)/p) ^ Ap(X„) < 0(ri(P+2)/P), which shows that the diameter of d^^p on UTn, %. and 
er„ is indeed in 6(71(^+2)/^). 

We finally prove the case p = 0, which needs a completely different argument. 





K 



K' 



Figure 27: The caterpillars used in the proof of Lemma 17 



Lemma 17. The diameter of d^^ on lATn, %i and BTn is in Q{n^). 

Proof. Since the cophenetic vector of a tree T £ UTn lies in it is clear that (iip,o(7i, T2) ^ 

n{n + l)/2, for every 7i,T2 G UTn- Now, consider the pair of rooted caterpillars with n leaves depicted in 

Fig. [27] We have that 

VKihj) =n- j (pK'{i,j) = i - 1 for every 1 < i < j n 

ifxihi) — n — i + I ipK'{i,i) ~ i for every 2 ^ i ^ n — 1 

ipKilA)^n-l ^K'(l,l)-1 

ifK {n,n) = 1 LpK' {n,n) — n ~ 1 

This shows that the number of pairs (i, j), 1 ^ i ^ j ^ n, such that ipxihi) = '^K'ihi) is at most (n + l)/2, 

and therefore that d^fi{K, K') is at least {n^ — l)/2. So, the diameter of d^^ on UTn is bounded from above 
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by O(n^), and its diameter on BTn is bounded from below by 0{n?), which implies that the diameter of 
d^fi on UTn, Tn and BTn is in O(n^). □ 



